Team Members

  • Xingzhi Du
  • Chen Liu
  • Banghao Chi

Introduction

This project explores factors influencing building energy consumption in Southern California. Energy efficiency is a critical concern in modern urban settings, particularly in regions with high electricity demands and varying environmental conditions. By analyzing energy consumption data, we aim to identify patterns and factors that drive electricity usage, offering insights for energy optimization.

Background and Description of the Dataset

  • Source: Kaggle

  • Description:
    The dataset contains hourly electricity usage data for residential, commercial, and industrial buildings in Southern California, spanning from January 2018 to January 2024. It includes over 100 facilities and integrates information from smart meters, IoT sensors, and utility companies. Key metrics include electricity usage, weather conditions, and building characteristics, making it suitable for time-series analysis, energy forecasting, and studying energy efficiency.

  • Dataset Summary:

    • Number of Records: 52,586
    • Number of Variables: 12
    • Key Variables:
      • Timestamp: Hourly record of electricity usage.
      • Building Type: Categorical variable (residential, commercial, or industrial).
      • Energy Consumption (kWh): Continuous numeric variable representing total electricity usage.
      • Temperature (°C): Continuous numeric variable reflecting ambient temperature.
      • Solar Radiation (W/m²): Continuous numeric variable measuring solar energy exposure.
      • HVAC Consumption (kWh): Continuous numeric variable for heating, ventilation, and air conditioning energy usage.
      • Lighting Consumption (kWh): Continuous numeric variable for lighting energy usage.
      • Peak Demand Reduction Indicator: Binary variable indicating participation in demand-response programs.
      • Energy Price ($/kWh): Continuous numeric variable indicating electricity costs.
      • Building Age: Numeric variable indicating building age in years.
      • Building Size: Numeric variable measuring the building’s area in square meters.
      • Carbon Emission Reduction Category: Categorical variable based on sustainability initiatives.
library(ggplot2)
library(dplyr)
data = read.csv("electricity_consumption_optimization_dataset.csv")
str(data)
## 'data.frame':    52585 obs. of  40 variables:
##  $ Timestamp                          : chr  "1/01/2018 0:00" "1/01/2018 1:00" "1/01/2018 2:00" "1/01/2018 3:00" ...
##  $ Building.Type                      : chr  "Residential" "Industrial" "Commercial" "Residential" ...
##  $ Energy.Consumption..kWh.           : num  74.7 46.6 58.8 53.6 37.8 ...
##  $ Temperature                        : num  31.4 30.2 19.2 16.7 29.6 ...
##  $ Humidity....                       : num  62.5 63.1 65 67.4 55.1 ...
##  $ Occupancy.Rate....                 : num  49.3 65 -16.6 27.4 74.2 ...
##  $ Lighting.Consumption..kWh.         : num  9.892 11.064 0.582 3.58 17.824 ...
##  $ HVAC.Consumption..kWh.             : num  9.07 26.49 10.39 8.2 12.27 ...
##  $ Energy.Price....kWh.               : num  0.0533 0.019 0.0603 0.2093 0.2253 ...
##  $ Carbon.Emission.Rate..g.CO2.kWh.   : num  342 427 278 692 621 ...
##  $ Power.Factor                       : num  0.932 0.984 0.801 0.921 0.902 ...
##  $ Voltage.Levels..V.                 : num  238 221 238 225 229 ...
##  $ Reactive.Power..kVARh.             : num  7.98 6.99 3.83 6.6 4.77 ...
##  $ Power.Outage.Indicator             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Indoor.Temperature...C.            : num  25 19.6 17.5 20.3 24.8 ...
##  $ Building.Age..years.               : num  31.8 16.5 34.9 25.3 30 ...
##  $ Equipment.Age..years.              : num  15.36 5.33 7.25 8.24 13.85 ...
##  $ Energy.Efficiency.Rating           : num  50 53.6 64.1 59.2 82.3 ...
##  $ Building.Size.m.2.                 : num  545.8 1308.2 -47.6 348.6 1240.7 ...
##  $ Window.to.Wall.Ratio....           : num  32.3 37.1 22.5 30.2 43.7 ...
##  $ Insulation.Quality.Score           : num  9.88 2.39 6.86 9.46 10.24 ...
##  $ Historical.Energy.Consumption..kWh.: num  81.7 82.3 49.6 3.1 66.1 ...
##  $ Maintenance.Status                 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Demand.Response.Participation      : int  1 0 0 1 0 0 1 0 0 0 ...
##  $ Occupancy.Schedule                 : chr  "Occupied" "Occupied" "Vacant" "Occupied" ...
##  $ Local.Energy.Production..kWh.      : num  6.44 6.45 7.23 11.4 9.9 ...
##  $ Grid.Stability.Score               : num  80.7 91.5 72.9 91.3 78.2 ...
##  $ Solar.Irradiance                   : num  159 295 135 240 325 ...
##  $ Smart.Plug.Usage..kWh.             : num  0.3185 0.0998 0.0192 0.3154 0.2355 ...
##  $ Water.Usage..liters.               : num  102.11 178.5 104.65 8.94 99.79 ...
##  $ Energy.Savings.Target....          : num  16.2 16.1 17.5 14.1 21 ...
##  $ Room.Level.Energy.Consumption..kWh.: num  13.3 11.8 24.3 24.6 24.5 ...
##  $ Zonal.Heating.Cooling.Data..kWh.   : num  6.72 7.04 12.87 9.53 10.04 ...
##  $ Electric.Vehicle.Charging.Status   : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ Building.Orientation               : chr  "South" "South" "North" "South" ...
##  $ IoT.Sensor.Count                   : num  21.4 34.4 67.6 37.7 36.3 ...
##  $ Thermal.Comfort.Index              : num  80.8 79.7 84.6 96 68.2 ...
##  $ Energy.Savings.Potential....       : num  14.12 4.11 4.13 8.68 19.63 ...
##  $ Peak.Demand.Reduction.Indicator    : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ Carbon.Emission.Reduction.Category : chr  "Moderate Reduction" "Moderate Reduction" "Moderate Reduction" "No Reduction" ...

Objectives

` This analysis aims to:

  1. Identify the key factors that significantly impact building energy consumption.
  2. Develop a predictive model for energy usage based on available data.
  3. Provide actionable insights for optimizing energy strategies in Southern California buildings.

Statement of Interest

In an era of increasing focus on climate change and energy efficiency, understanding the factors that influence building energy consumption is crucial for sustainable urban development. This dataset offers a valuable opportunity to uncover patterns and develop strategies for optimizing energy usage in Southern California, contributing to more sustainable and efficient energy management practices.

Methods

Data Acquisition and Cleaning

The raw dataset was first loaded and cleaned to prepare it for analysis. The following steps were performed:

  1. Filtering for Full Year 2023:
    • The dataset was filtered to include only records from January 1, 2023, to December 31, 2023, based on the Timestamp column.
  2. Selecting Relevant Columns:
    • Only columns relevant to energy consumption analysis were retained, such as building type, energy consumption, temperature, and solar irradiance and so on.
  3. Removing Negative Values:
    • Columns with numeric data (e.g., Energy Consumption, Solar Irradiance and so on) were checked for invalid negative values, which were removed.
  4. Random Sampling:
    • The filtered dataset was randomly sampled to 2000 rows to make the analysis manageable.
  5. Saving Cleaned Data:
    • The cleaned dataset was saved to a CSV file for further analysis.

Next, we will first acquire the data and then perform data cleaning

Data Acquisition

df = read.csv("electricity_consumption_optimization_dataset.csv")
head(df)
##        Timestamp Building.Type Energy.Consumption..kWh. Temperature
## 1 1/01/2018 0:00   Residential                    74.68       31.36
## 2 1/01/2018 1:00    Industrial                    46.59       30.23
## 3 1/01/2018 2:00    Commercial                    58.84       19.18
## 4 1/01/2018 3:00   Residential                    53.59       16.70
## 5 1/01/2018 4:00   Residential                    37.80       29.62
## 6 1/01/2018 5:00   Residential                    62.58       27.28
##   Humidity.... Occupancy.Rate.... Lighting.Consumption..kWh.
## 1        62.47              49.29                     9.8921
## 2        63.07              65.04                    11.0637
## 3        65.03             -16.60                     0.5823
## 4        67.41              27.40                     3.5800
## 5        55.07              74.22                    17.8236
## 6        73.05              68.00                    10.2164
##   HVAC.Consumption..kWh. Energy.Price....kWh. Carbon.Emission.Rate..g.CO2.kWh.
## 1                  9.073              0.05330                            341.8
## 2                 26.488              0.01903                            427.3
## 3                 10.386              0.06028                            278.1
## 4                  8.200              0.20932                            691.9
## 5                 12.268              0.22533                            620.5
## 6                 11.016              0.21826                            631.3
##   Power.Factor Voltage.Levels..V. Reactive.Power..kVARh. Power.Outage.Indicator
## 1       0.9321              237.5                  7.982                      0
## 2       0.9842              221.2                  6.990                      0
## 3       0.8007              237.6                  3.826                      0
## 4       0.9213              224.6                  6.601                      0
## 5       0.9016              229.2                  4.767                      0
## 6       0.8723              232.4                  6.613                      0
##   Indoor.Temperature...C. Building.Age..years. Equipment.Age..years.
## 1                  24.998                31.77                15.357
## 2                  19.593                16.46                 5.326
## 3                  17.459                34.93                 7.254
## 4                  20.344                25.27                 8.244
## 5                  24.778                30.00                13.851
## 6                   9.747                33.51                21.895
##   Energy.Efficiency.Rating Building.Size.m.2. Window.to.Wall.Ratio....
## 1                    50.02             545.82                    32.26
## 2                    53.58            1308.15                    37.11
## 3                    64.05             -47.62                    22.45
## 4                    59.22             348.65                    30.16
## 5                    82.28            1240.68                    43.70
## 6                    65.58            1336.67                    35.58
##   Insulation.Quality.Score Historical.Energy.Consumption..kWh.
## 1                    9.877                               81.70
## 2                    2.385                               82.27
## 3                    6.860                               49.56
## 4                    9.456                                3.10
## 5                   10.239                               66.12
## 6                    7.481                               37.22
##   Maintenance.Status Demand.Response.Participation Occupancy.Schedule
## 1                  0                             1           Occupied
## 2                  0                             0           Occupied
## 3                  0                             0             Vacant
## 4                  0                             1           Occupied
## 5                  0                             0           Occupied
## 6                  0                             0           Occupied
##   Local.Energy.Production..kWh. Grid.Stability.Score Solar.Irradiance
## 1                         6.437                80.67            159.0
## 2                         6.454                91.47            294.8
## 3                         7.226                72.85            134.9
## 4                        11.395                91.34            239.7
## 5                         9.901                78.19            325.3
## 6                         6.724                75.82            145.2
##   Smart.Plug.Usage..kWh. Water.Usage..liters. Energy.Savings.Target....
## 1                0.31852              102.114                     16.21
## 2                0.09984              178.497                     16.08
## 3                0.01916              104.648                     17.46
## 4                0.31539                8.939                     14.13
## 5                0.23552               99.788                     20.96
## 6                0.12821              125.938                     15.68
##   Room.Level.Energy.Consumption..kWh. Zonal.Heating.Cooling.Data..kWh.
## 1                               13.34                            6.720
## 2                               11.75                            7.041
## 3                               24.30                           12.874
## 4                               24.59                            9.527
## 5                               24.48                           10.040
## 6                               33.10                           21.046
##   Electric.Vehicle.Charging.Status Building.Orientation IoT.Sensor.Count
## 1                                0                South            21.43
## 2                                1                South            34.39
## 3                                0                North            67.59
## 4                                0                South            37.71
## 5                                0                North            36.32
## 6                                0                 East            58.87
##   Thermal.Comfort.Index Energy.Savings.Potential....
## 1                 80.81                       14.115
## 2                 79.68                        4.108
## 3                 84.57                        4.131
## 4                 95.95                        8.682
## 5                 68.23                       19.632
## 6                 65.30                        5.696
##   Peak.Demand.Reduction.Indicator Carbon.Emission.Reduction.Category
## 1                               0                 Moderate Reduction
## 2                               0                 Moderate Reduction
## 3                               0                 Moderate Reduction
## 4                               0                       No Reduction
## 5                               0                 Moderate Reduction
## 6                               0                      Low Reduction
names(df)
##  [1] "Timestamp"                           "Building.Type"                      
##  [3] "Energy.Consumption..kWh."            "Temperature"                        
##  [5] "Humidity...."                        "Occupancy.Rate...."                 
##  [7] "Lighting.Consumption..kWh."          "HVAC.Consumption..kWh."             
##  [9] "Energy.Price....kWh."                "Carbon.Emission.Rate..g.CO2.kWh."   
## [11] "Power.Factor"                        "Voltage.Levels..V."                 
## [13] "Reactive.Power..kVARh."              "Power.Outage.Indicator"             
## [15] "Indoor.Temperature...C."             "Building.Age..years."               
## [17] "Equipment.Age..years."               "Energy.Efficiency.Rating"           
## [19] "Building.Size.m.2."                  "Window.to.Wall.Ratio...."           
## [21] "Insulation.Quality.Score"            "Historical.Energy.Consumption..kWh."
## [23] "Maintenance.Status"                  "Demand.Response.Participation"      
## [25] "Occupancy.Schedule"                  "Local.Energy.Production..kWh."      
## [27] "Grid.Stability.Score"                "Solar.Irradiance"                   
## [29] "Smart.Plug.Usage..kWh."              "Water.Usage..liters."               
## [31] "Energy.Savings.Target...."           "Room.Level.Energy.Consumption..kWh."
## [33] "Zonal.Heating.Cooling.Data..kWh."    "Electric.Vehicle.Charging.Status"   
## [35] "Building.Orientation"                "IoT.Sensor.Count"                   
## [37] "Thermal.Comfort.Index"               "Energy.Savings.Potential...."       
## [39] "Peak.Demand.Reduction.Indicator"     "Carbon.Emission.Reduction.Category"
# Load required libraries
library(lubridate)

# Convert Timestamp to proper datetime format
df$Timestamp = dmy_hm(df$Timestamp)  # This handles dd/mm/yyyy HH:MM format

# Filter for full year 2023 and keep all columns
df_2023 = df %>%
  filter(Timestamp >= as.POSIXct("2023-01-01 00:00:00") & 
         Timestamp <= as.POSIXct("2023-12-31 23:59:59")) %>%
  select(
    'Timestamp',
    'Building.Type',
    'Energy.Consumption..kWh.',
    'Temperature',
    'Solar.Irradiance',
    'HVAC.Consumption..kWh.',
    'Lighting.Consumption..kWh.',
    'Peak.Demand.Reduction.Indicator',
    'Energy.Price....kWh.',
    'Building.Age..years.',
    'Building.Size.m.2.',
    'Carbon.Emission.Reduction.Category'
  )

# set seed to ensure reproducibility
set.seed(123)
final_df = df_2023 %>%
  slice_sample(n = 2000)

# Save the filtered data to a new CSV file
write.csv(final_df, "energy_data_20231.csv", row.names = FALSE)

# Check the date range in the final dataset
range(final_df$Timestamp)
## [1] "2023-01-01 06:00:00 UTC" "2023-12-31 22:00:00 UTC"
# Check dimensions of the new dataset
dim(final_df)
## [1] 2000   12

Data cleaning

# Load the cleaned data
energy_data = read.csv("energy_data_20231.csv")

# Convert negative values to NA for numeric columns (except Peak.Demand.Reduction.Indicator)
clean_data = energy_data %>%
  mutate(
    Energy.Consumption..kWh. = ifelse(Energy.Consumption..kWh. < 0, NA, Energy.Consumption..kWh.),
    Solar.Irradiance = ifelse(Solar.Irradiance < 0, NA, Solar.Irradiance),
    HVAC.Consumption..kWh. = ifelse(HVAC.Consumption..kWh. < 0, NA, HVAC.Consumption..kWh.),
    Lighting.Consumption..kWh. = ifelse(Lighting.Consumption..kWh. < 0, NA, Lighting.Consumption..kWh.),
    Energy.Price....kWh. = ifelse(Energy.Price....kWh. < 0, NA, Energy.Price....kWh.),
    Building.Age..years. = ifelse(Building.Age..years. < 0, NA, Building.Age..years.),
    Building.Size.m.2. = ifelse(Building.Size.m.2. < 0, NA, Building.Size.m.2.)
  )

# Remove rows with any NA values
final_clean_data = clean_data %>%
  na.omit()

# Ensure the output column names are consistent
final_clean_data = final_clean_data %>%
  select(
    Timestamp,
    Building.Type,
    Energy.Consumption..kWh.,
    Temperature,
    Solar.Irradiance,
    HVAC.Consumption..kWh.,
    Lighting.Consumption..kWh.,
    Peak.Demand.Reduction.Indicator,
    Energy.Price....kWh.,
    Building.Age..years.,
    Building.Size.m.2.,
    Carbon.Emission.Reduction.Category
  )

# Save the cleaned data
write.csv(final_clean_data, "energy_data_2023_clean.csv", row.names = FALSE)

# Check how many rows were removed
cat("Original number of rows:", nrow(energy_data), "\n")
## Original number of rows: 2000
cat("Number of rows after removing NA values:", nrow(final_clean_data), "\n")
## Number of rows after removing NA values: 1842
cat("Number of rows removed:", nrow(energy_data) - nrow(final_clean_data), "\n")
## Number of rows removed: 158
# View summary statistics of numeric columns
summary(final_clean_data)
##   Timestamp         Building.Type      Energy.Consumption..kWh.  Temperature  
##  Length:1842        Length:1842        Min.   :  1.47           Min.   :-7.7  
##  Class :character   Class :character   1st Qu.: 40.81           1st Qu.:14.4  
##  Mode  :character   Mode  :character   Median : 54.16           Median :21.1  
##                                        Mean   : 54.30           Mean   :21.1  
##                                        3rd Qu.: 67.91           3rd Qu.:27.8  
##                                        Max.   :119.42           Max.   :62.7  
##  Solar.Irradiance HVAC.Consumption..kWh. Lighting.Consumption..kWh.
##  Min.   :  0.1    Min.   : 0.07          Min.   : 0.087            
##  1st Qu.:151.9    1st Qu.:11.67          1st Qu.: 7.902            
##  Median :220.3    Median :16.60          Median :11.316            
##  Mean   :223.9    Mean   :16.35          Mean   :11.320            
##  3rd Qu.:292.1    3rd Qu.:20.87          3rd Qu.:14.615            
##  Max.   :527.7    Max.   :39.58          Max.   :28.845            
##  Peak.Demand.Reduction.Indicator Energy.Price....kWh. Building.Age..years.
##  Min.   :0.000                   Min.   :0.002        Min.   : 0.2        
##  1st Qu.:0.000                   1st Qu.:0.123        1st Qu.:15.0        
##  Median :0.000                   Median :0.156        Median :21.6        
##  Mean   :0.151                   Mean   :0.155        Mean   :22.0        
##  3rd Qu.:0.000                   3rd Qu.:0.188        3rd Qu.:28.5        
##  Max.   :1.000                   Max.   :0.331        Max.   :51.7        
##  Building.Size.m.2. Carbon.Emission.Reduction.Category
##  Min.   :  23.9     Length:1842                       
##  1st Qu.: 805.0     Class :character                  
##  Median :1158.7     Mode  :character                  
##  Mean   :1156.6                                       
##  3rd Qu.:1503.4                                       
##  Max.   :2654.9
# Check dimensions of final dataset
dim(final_clean_data)
## [1] 1842   12

Statistical Analysis

To understand the characteristics of the data and relationships between variables, we performed the following analyses:

  1. Descriptive Statistics:
    • Summary statistics were calculated for all numeric variables to understand the central tendency and variability.
    • Standard deviation was computed to measure data dispersion.
# Descriptive statistics
summary(final_clean_data)
##   Timestamp         Building.Type      Energy.Consumption..kWh.  Temperature  
##  Length:1842        Length:1842        Min.   :  1.47           Min.   :-7.7  
##  Class :character   Class :character   1st Qu.: 40.81           1st Qu.:14.4  
##  Mode  :character   Mode  :character   Median : 54.16           Median :21.1  
##                                        Mean   : 54.30           Mean   :21.1  
##                                        3rd Qu.: 67.91           3rd Qu.:27.8  
##                                        Max.   :119.42           Max.   :62.7  
##  Solar.Irradiance HVAC.Consumption..kWh. Lighting.Consumption..kWh.
##  Min.   :  0.1    Min.   : 0.07          Min.   : 0.087            
##  1st Qu.:151.9    1st Qu.:11.67          1st Qu.: 7.902            
##  Median :220.3    Median :16.60          Median :11.316            
##  Mean   :223.9    Mean   :16.35          Mean   :11.320            
##  3rd Qu.:292.1    3rd Qu.:20.87          3rd Qu.:14.615            
##  Max.   :527.7    Max.   :39.58          Max.   :28.845            
##  Peak.Demand.Reduction.Indicator Energy.Price....kWh. Building.Age..years.
##  Min.   :0.000                   Min.   :0.002        Min.   : 0.2        
##  1st Qu.:0.000                   1st Qu.:0.123        1st Qu.:15.0        
##  Median :0.000                   Median :0.156        Median :21.6        
##  Mean   :0.151                   Mean   :0.155        Mean   :22.0        
##  3rd Qu.:0.000                   3rd Qu.:0.188        3rd Qu.:28.5        
##  Max.   :1.000                   Max.   :0.331        Max.   :51.7        
##  Building.Size.m.2. Carbon.Emission.Reduction.Category
##  Min.   :  23.9     Length:1842                       
##  1st Qu.: 805.0     Class :character                  
##  Median :1158.7     Mode  :character                  
##  Mean   :1156.6                                       
##  3rd Qu.:1503.4                                       
##  Max.   :2654.9
# Standard deviation for numeric columns
numeric_columns = final_clean_data %>%
  select(where(is.numeric))
sapply(numeric_columns, sd, na.rm = TRUE)
##        Energy.Consumption..kWh.                     Temperature 
##                        19.47877                        10.00629 
##                Solar.Irradiance          HVAC.Consumption..kWh. 
##                       100.21482                         6.64073 
##      Lighting.Consumption..kWh. Peak.Demand.Reduction.Indicator 
##                         4.89607                         0.35860 
##            Energy.Price....kWh.            Building.Age..years. 
##                         0.05033                         9.53429 
##              Building.Size.m.2. 
##                       492.26153

Dataset Splitting

The dataset was divided into a training set (80%) and a testing set (20%) to ensure robust evaluation of model performance.

# Set seed for reproducibility
set.seed(123)

# Split the data into training (80%) and testing (20%) sets
train_indices = sample(1:nrow(final_clean_data), size = 0.8 * nrow(final_clean_data))
train_data = final_clean_data[train_indices, ]
test_data = final_clean_data[-train_indices, ]

# Check dimensions of the splits
dim(train_data)  # Training set
## [1] 1473   12
dim(test_data)   # Testing set
## [1] 369  12

After splitting the dataset, the training set will be used to explore various modeling approaches, including simple and multiple linear regression, interaction terms, and transformations. The testing set will be used to evaluate model performance using metrics such as RMSE and R-squared, ensuring the model’s applicability to unseen data.

Distribution Analysis

Histograms and density plots were created to examine the distribution of key variables. Boxplots were used to explore the relationship between the categorical variable Building.Type and the response variable Energy.Consumption..kWh.. All of which are examined to enhance the understanding of the energy consumption dataset.

# Histogram
ggplot(final_clean_data, aes(x = Energy.Consumption..kWh.)) +
  geom_histogram(bins = 30, fill = "blue", color = "black", alpha = 0.7) +
  labs(title = "Distribution of Energy Consumption", x = "Energy Consumption (kWh)", y = "Frequency")

The energy consumption data follows an approximately normal distribution with a slight right skew. The peak occurs around 50-60 kWh, with consumption values ranging from 0 to 120 kWh. Most households consume between 20-100 kWh, though some outliers show notably higher usage. CopyRetry

# Density plot for Temperature
ggplot(final_clean_data, aes(x = Temperature)) +
  geom_density(fill = "green", alpha = 0.5) +
  labs(title = "Density Plot of Temperature", x = "Temperature (°C)", y = "Density")

The temperature distribution shows a slight right skew, with values ranging from -7-62°C and peaking at around 20°C. Most temperatures fall between 0-40°C, though some higher temperature readings are observed.

# Boxplot for Building Type vs Energy Consumption
ggplot(final_clean_data, aes(x = Building.Type, y = Energy.Consumption..kWh., fill = Building.Type)) +
  geom_boxplot() +
  labs(title = "Energy Consumption by Building Type", x = "Building Type", y = "Energy Consumption (kWh)")

The boxplot compares energy consumption across three building types: commercial, industrial, and residential. All three categories show similar median consumption around 50-55 kWh, implying the energy consumption for all three building types are mostly the same. The spread of consumption is also comparable across types, though residential buildings exhibit more outliers at higher consumption levels. The interquartile ranges span roughly from 40-70 kWh for all building types.

Correlation Analysis Results

  • In the pairs plot, strong correlations should appear as a clear trend (e.g., points aligned along a line, sloping upwards for positive correlation).

  • However, in the scatterplots involving the response variable, the points appear to form random clouds rather than a strong trend.

  • The corresponding correlation coefficients (small values close to zero in the upper triangle) and the correlation matrix confirm that no variables show a strong correlation with the response variable.

library(faraway)
# Select numeric columns for analysis
numeric_columns = final_clean_data %>%
  select(Energy.Consumption..kWh.,
         Temperature,
         Solar.Irradiance,
         HVAC.Consumption..kWh.,
         Lighting.Consumption..kWh.,
         Energy.Price....kWh.,
         Building.Age..years.,
         Building.Size.m.2.)

# Custom panel.cor function to add correlation coefficients
panel.cor = function(x, y, digits = 2, prefix = "", cex.cor = 0.8, ...) {
    usr = par("usr")
    on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r = cor(x, y, use = "complete.obs")
    txt = format(c(r, 0.123456789), digits = digits)[1]
    txt = paste0(prefix, txt)
    text(0.5, 0.5, txt, cex = cex.cor)
}
# Generate the pairs plot in the Results section
pairs(numeric_columns,
      col = "dodgerblue",
      pch = 16,
      cex = 0.5,
      gap = 0,
      upper.panel = panel.cor,
      lower.panel = panel.smooth)

Training and Testing Set Results

The dataset was successfully split into training and testing sets to prepare for model building and validation: - Training Set: 80% of the data, containing 1473 rows. - Testing Set: 20% of the data, containing 369 rows.

The following histogram illustrates the distribution of Energy.Consumption..kWh. in both the training and testing sets. The distributions are similar, indicating that the split has maintained the representativeness of the original dataset.

Compare means and standard deviations between training and testing sets

# Combine training and testing sets for visualization
train_data$Set = "Training Set"
test_data$Set = "Testing Set"
combined_data = rbind(train_data, test_data)

# Histogram comparison
ggplot(combined_data, aes(x = Energy.Consumption..kWh., fill = Set)) +
  geom_histogram(bins = 30, alpha = 0.7, position = "identity", color = "black") +
  labs(title = "Energy Consumption Distribution in Training vs. Testing Sets",
       x = "Energy Consumption (kWh)", y = "Frequency") +
  scale_fill_manual(name = "Dataset", values = c("Training Set" = "lightblue", "Testing Set" = "red")) +
  theme_minimal()

train_summary = train_data %>% summarise(across(where(is.numeric), list(mean = mean, sd = sd), na.rm = TRUE))
test_summary = test_data %>% summarise(across(where(is.numeric), list(mean = mean, sd = sd), na.rm = TRUE))
print(train_summary)
##   Energy.Consumption..kWh._mean Energy.Consumption..kWh._sd Temperature_mean
## 1                          54.2                       19.53            21.05
##   Temperature_sd Solar.Irradiance_mean Solar.Irradiance_sd
## 1          10.04                 222.8               100.4
##   HVAC.Consumption..kWh._mean HVAC.Consumption..kWh._sd
## 1                       16.28                     6.688
##   Lighting.Consumption..kWh._mean Lighting.Consumption..kWh._sd
## 1                           11.31                         4.935
##   Peak.Demand.Reduction.Indicator_mean Peak.Demand.Reduction.Indicator_sd
## 1                               0.1385                             0.3455
##   Energy.Price....kWh._mean Energy.Price....kWh._sd Building.Age..years._mean
## 1                    0.1538                 0.05037                     22.08
##   Building.Age..years._sd Building.Size.m.2._mean Building.Size.m.2._sd
## 1                   9.631                    1163                 490.5
print(test_summary)
##   Energy.Consumption..kWh._mean Energy.Consumption..kWh._sd Temperature_mean
## 1                          54.7                       19.31            21.28
##   Temperature_sd Solar.Irradiance_mean Solar.Irradiance_sd
## 1          9.877                 228.1               99.38
##   HVAC.Consumption..kWh._mean HVAC.Consumption..kWh._sd
## 1                       16.63                     6.452
##   Lighting.Consumption..kWh._mean Lighting.Consumption..kWh._sd
## 1                           11.37                         4.742
##   Peak.Demand.Reduction.Indicator_mean Peak.Demand.Reduction.Indicator_sd
## 1                               0.2033                              0.403
##   Energy.Price....kWh._mean Energy.Price....kWh._sd Building.Age..years._mean
## 1                    0.1594                 0.04997                     21.61
##   Building.Age..years._sd Building.Size.m.2._mean Building.Size.m.2._sd
## 1                    9.14                    1130                 499.1

Discussion of the Preparation Process

Statistical Analysis

The initial descriptive statistics provided insights into the central tendencies and variabilities of the key variables in the dataset. Specifically:

Energy Consumption (kWh): The mean energy consumption was approximately 54.3 kWh, with a standard deviation of 19.48 kWh, indicating moderate variability in energy usage across the analyzed buildings.

Temperature (°C): The average temperature was 21.1°C, with a standard deviation of 10.01°C. This highlights the diverse climate conditions in Southern California over the analyzed period.

HVAC and Lighting Consumption: These two predictors showed relatively high means (16.35 kWh and 11.32 kWh, respectively) with lower variability compared to other predictors, suggesting their consistent contributions to overall energy consumption.

These statistics provided a foundational understanding of the dataset and highlighted variables likely to have significant effects on energy consumption.

Possible Explanations for Low Correlation

  1. Non-linear Relationships:
    • Variables like Temperature may have non-linear effects (e.g., higher energy consumption during extreme weather).
  2. Interaction Effects:
    • Certain predictors (e.g., HVAC.Consumption..kWh. and Temperature) might interact, amplifying their combined impact.
  3. Uncaptured Variables:
    • Missing factors such as building occupancy or time-of-day effects could mask true relationships.

Implications for Model Building

Addressing Non-linearities

  • Apply transformations (e.g., log or square root) to variables like Solar.Irradiance and Temperature to capture non-linear trends.

Investigating Interaction Effects

  • Explore interactions (e.g., HVAC.Consumption..kWh. × Temperature) to account for combined effects on energy usage.

Refining Variable Selection

  • Use methods like AIC/BIC or LASSO to identify the most significant predictors for the model.

Dataset Splitting Analysis

The training and testing sets were split with an 80/20 ratio, maintaining the overall distribution of Energy.Consumption..kWh.. Consistency in means and standard deviations across both sets confirms their representativeness. This ensures the training set provides robust data for model building, while the testing set allows unbiased performance evaluation, creating a solid foundation for predictive modeling.

Results

Simple Linear Regression Analysis

Based on our correlation analysis from earlier sections, we’ll begin by exploring the relationship between Energy Consumption and HVAC Consumption using simple linear regression. This will help us understand the baseline relationship before moving to more complex models.

SLR with HVAC Consumption

# Fit SLR model using HVAC consumption as predictor
slr_model = lm(Energy.Consumption..kWh. ~ HVAC.Consumption..kWh., data = train_data)
summary(slr_model)
## 
## Call:
## lm(formula = Energy.Consumption..kWh. ~ HVAC.Consumption..kWh., 
##     data = train_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -53.16 -13.41  -0.14  13.51  64.65 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             55.1341     1.3398   41.15   <2e-16 ***
## HVAC.Consumption..kWh.  -0.0575     0.0761   -0.75     0.45    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.5 on 1471 degrees of freedom
## Multiple R-squared:  0.000387,   Adjusted R-squared:  -0.000292 
## F-statistic: 0.57 on 1 and 1471 DF,  p-value: 0.45

The simple linear regression results show surprisingly weak relationship between HVAC consumption and total energy consumption:

  • The coefficient for HVAC consumption (-0.0575) is not statistically significant (p-value = 0.45)
  • The extremely low R-squared value (0.000387) indicates that HVAC consumption alone explains virtually none of the variance in total energy consumption
  • The F-statistic (0.57) with a p-value of 0.45 suggests this model is not better than a horizontal line

Let’s visualize this relationship:

# Plot the relationship
ggplot(train_data, aes(x = HVAC.Consumption..kWh., y = Energy.Consumption..kWh.)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", color = "blue") +
  labs(title = "Energy Consumption vs HVAC Consumption",
       x = "HVAC Consumption (kWh)",
       y = "Energy Consumption (kWh)")

The scatter plot confirms our statistical findings, showing a nearly flat regression line and widely scattered points, suggesting that the relationship between HVAC consumption and total energy consumption is not linear, or that other factors may be more important in determining total energy consumption.

SLR with Temperature

After examining HVAC consumption, we’ll investigate the relationship between temperature and energy consumption, as temperature is often considered a key driver of building energy use through its impact on heating and cooling needs.

# SLR with Temperature
slr_temp = lm(Energy.Consumption..kWh. ~ Temperature, data = train_data)
summary(slr_temp)
## 
## Call:
## lm(formula = Energy.Consumption..kWh. ~ Temperature, data = train_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -52.57 -13.44  -0.15  13.42  65.75 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  55.1491     1.1821   46.65   <2e-16 ***
## Temperature  -0.0452     0.0507   -0.89     0.37    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.5 on 1471 degrees of freedom
## Multiple R-squared:  0.000539,   Adjusted R-squared:  -0.00014 
## F-statistic: 0.794 on 1 and 1471 DF,  p-value: 0.373
# Plot Temperature relationship
ggplot(train_data, aes(x = Temperature, y = Energy.Consumption..kWh.)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", color = "blue") +
  labs(title = "Energy Consumption vs Temperature",
       x = "Temperature (°C)",
       y = "Energy Consumption (kWh)")

The temperature-based SLR model shows:

  • A very weak negative relationship (coefficient = -0.0452)
  • No statistical significance (p-value = 0.37)
  • Extremely low R-squared (0.000539), indicating temperature alone explains virtually none of the variance in energy consumption
  • The nearly flat regression line and scattered points suggest that the relationship between temperature and energy consumption might be non-linear, or that other factors may be more important

SLR with Building Size

Next, we’ll examine how building size relates to energy consumption, as larger buildings might be expected to consume more energy for lighting, heating, and cooling.

# SLR with Building Size
slr_size = lm(Energy.Consumption..kWh. ~ Building.Size.m.2., data = train_data)
summary(slr_size)
## 
## Call:
## lm(formula = Energy.Consumption..kWh. ~ Building.Size.m.2., data = train_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -51.73 -13.17  -0.46  13.45  64.04 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        56.77997    1.30804   43.41   <2e-16 ***
## Building.Size.m.2. -0.00222    0.00104   -2.14    0.032 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.5 on 1471 degrees of freedom
## Multiple R-squared:  0.00311,    Adjusted R-squared:  0.00243 
## F-statistic: 4.59 on 1 and 1471 DF,  p-value: 0.0324
# Plot Building Size relationship
ggplot(train_data, aes(x = Building.Size.m.2., y = Energy.Consumption..kWh.)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", color = "blue") +
  labs(title = "Energy Consumption vs Building Size",
       x = "Building Size (m²)",
       y = "Energy Consumption (kWh)")

The building size model reveals:

  • A significant but very small negative relationship (coefficient = -0.00222, p-value = 0.032)
  • While statistically significant, the practical significance is minimal given the extremely low R-squared (0.00311)
  • The F-statistic (4.59) indicates the model is marginally better than a null model, but still explains very little variation
  • Similar to previous models, the scattered points and nearly flat line suggest that building size alone is not a strong predictor of energy consumption

SLR with Lighting Consumption

Finally, we’ll analyze the relationship between lighting consumption and total energy consumption, as lighting is typically a significant component of building energy use.

# SLR with Lighting Consumption
slr_light = lm(Energy.Consumption..kWh. ~ Lighting.Consumption..kWh., data = train_data)
summary(slr_light)
## 
## Call:
## lm(formula = Energy.Consumption..kWh. ~ Lighting.Consumption..kWh., 
##     data = train_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -52.52 -13.37  -0.26  13.50  65.12 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 53.7659     1.2725   42.25   <2e-16 ***
## Lighting.Consumption..kWh.   0.0382     0.1031    0.37     0.71    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.5 on 1471 degrees of freedom
## Multiple R-squared:  9.35e-05,   Adjusted R-squared:  -0.000586 
## F-statistic: 0.138 on 1 and 1471 DF,  p-value: 0.711
# Plot Lighting Consumption relationship
ggplot(train_data, aes(x = Lighting.Consumption..kWh., y = Energy.Consumption..kWh.)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", color = "blue") +
  labs(title = "Energy Consumption vs Lighting Consumption",
       x = "Lighting Consumption (kWh)",
       y = "Energy Consumption (kWh)")

The lighting consumption model shows:

  • A very weak positive relationship (coefficient = 0.0382)
  • No statistical significance (p-value = 0.71)
  • The lowest R-squared of all models (9.35e-05), indicating practically no explanatory power
  • The visualization confirms the lack of relationship with widely scattered points and a nearly horizontal regression line

Summary of SLR Analyses

After examining four different predictors (HVAC, Temperature, Building Size, and Lighting Consumption) through simple linear regression:

  1. None of the individual predictors explain more than 0.5% of the variation in energy consumption
  2. Only building size showed statistical significance, but with minimal practical importance
  3. The consistently flat regression lines and scattered points across all analyses suggest that:
    • Multiple factors might need to be considered simultaneously
    • The data might need transformation or different modeling approaches

These findings strongly suggest we should proceed with multiple regression analysis and consider non-linear relationships or interactions between variables.

Multiple Linear Regression Analysis

Given the poor performance of the simple linear regression model, let’s extend our analysis to include multiple predictors that might better explain the variation in energy consumption.

Initial Full Model

We’ll start by including all relevant numeric predictors to see their combined effect on energy consumption:

# Fit full MLR model
mlr_full = lm(Energy.Consumption..kWh. ~ Temperature + Solar.Irradiance + 
               HVAC.Consumption..kWh. + Lighting.Consumption..kWh. + 
               Energy.Price....kWh. + Building.Age..years. + Building.Size.m.2. +
               Peak.Demand.Reduction.Indicator, data = train_data)
summary(mlr_full)
## 
## Call:
## lm(formula = Energy.Consumption..kWh. ~ Temperature + Solar.Irradiance + 
##     HVAC.Consumption..kWh. + Lighting.Consumption..kWh. + Energy.Price....kWh. + 
##     Building.Age..years. + Building.Size.m.2. + Peak.Demand.Reduction.Indicator, 
##     data = train_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -52.07 -13.39  -0.26  13.34  64.43 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     58.66241    3.37233   17.40   <2e-16 ***
## Temperature                     -0.03829    0.05084   -0.75     0.45    
## Solar.Irradiance                -0.00320    0.00508   -0.63     0.53    
## HVAC.Consumption..kWh.          -0.05800    0.07622   -0.76     0.45    
## Lighting.Consumption..kWh.       0.02953    0.10344    0.29     0.78    
## Energy.Price....kWh.            -1.27263   10.12715   -0.13     0.90    
## Building.Age..years.             0.03258    0.05289    0.62     0.54    
## Building.Size.m.2.              -0.00226    0.00104   -2.18     0.03 *  
## Peak.Demand.Reduction.Indicator -1.61705    1.47761   -1.09     0.27    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.5 on 1464 degrees of freedom
## Multiple R-squared:  0.00547,    Adjusted R-squared:  3.39e-05 
## F-statistic: 1.01 on 8 and 1464 DF,  p-value: 0.429

The full multiple regression model results show

  • Building.Size.m.2. is the only significant predictor (p < 0.05)
  • The overall model explains only 0.547% of the variance (R-squared = 0.00547)
  • The F-statistic (1.01) with p-value = 0.429 indicates the model is not significantly better than the null model

Checking for Multicollinearity

Before making any conclusions about the model, we should check for multicollinearity among predictors:

library(car)
vif(mlr_full)
##                     Temperature                Solar.Irradiance 
##                           1.006                           1.006 
##          HVAC.Consumption..kWh.      Lighting.Consumption..kWh. 
##                           1.003                           1.006 
##            Energy.Price....kWh.            Building.Age..years. 
##                           1.005                           1.002 
##              Building.Size.m.2. Peak.Demand.Reduction.Indicator 
##                           1.005                           1.007

The VIF analysis shows:

  • All VIF values are close to 1 (ranging from 1.002 to 1.007)
  • There is no significant multicollinearity among predictors
  • No need to remove variables based on VIF values

Despite the low VIF values, let’s try a reduced model focusing on the most theoretically relevant predictors:

mlr_reduced = lm(Energy.Consumption..kWh. ~ Temperature + Solar.Irradiance + 
                  HVAC.Consumption..kWh. + Energy.Price....kWh. + 
                  Building.Age..years. + Peak.Demand.Reduction.Indicator,
                  data = train_data)
vif(mlr_reduced)
##                     Temperature                Solar.Irradiance 
##                           1.004                           1.002 
##          HVAC.Consumption..kWh.            Energy.Price....kWh. 
##                           1.002                           1.002 
##            Building.Age..years. Peak.Demand.Reduction.Indicator 
##                           1.002                           1.005

The reduced model maintains low VIF values, confirming the absence of multicollinearity.

Influential Points Analysis

Next, let’s examine whether our model is being affected by influential observations:

# Calculate Cook's distance
cooks_d = cooks.distance(mlr_reduced)
plot(cooks_d, type = "h", main = "Cook's Distance Plot")
abline(h = 4/length(cooks_d), col = "red")

# Identify influential points
influential = which(cooks_d > 4/length(cooks_d))
length(influential)
## [1] 65

The Cook’s distance plot reveals:

  • 65 influential points were identified (above the threshold line)
  • These points might be affecting our model estimates

Let’s refit the model without these influential points:

# Fit model without influential points
mlr_no_influential = lm(Energy.Consumption..kWh. ~ Temperature + Solar.Irradiance + 
               HVAC.Consumption..kWh. + Lighting.Consumption..kWh. + 
               Energy.Price....kWh. + Building.Age..years. + Building.Size.m.2. +
               Peak.Demand.Reduction.Indicator,
               data = train_data[-influential,])
summary(mlr_no_influential)
## 
## Call:
## lm(formula = Energy.Consumption..kWh. ~ Temperature + Solar.Irradiance + 
##     HVAC.Consumption..kWh. + Lighting.Consumption..kWh. + Energy.Price....kWh. + 
##     Building.Age..years. + Building.Size.m.2. + Peak.Demand.Reduction.Indicator, 
##     data = train_data[-influential, ])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.27 -12.50   0.03  12.63  59.95 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     56.329752   3.149725   17.88   <2e-16 ***
## Temperature                     -0.010208   0.047955   -0.21    0.831    
## Solar.Irradiance                -0.004330   0.004742   -0.91    0.361    
## HVAC.Consumption..kWh.           0.018131   0.070710    0.26    0.798    
## Lighting.Consumption..kWh.      -0.014866   0.094863   -0.16    0.875    
## Energy.Price....kWh.            -1.014870   9.437111   -0.11    0.914    
## Building.Age..years.             0.059129   0.049599    1.19    0.233    
## Building.Size.m.2.              -0.001787   0.000958   -1.87    0.062 .  
## Peak.Demand.Reduction.Indicator -2.675955   1.393900   -1.92    0.055 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.6 on 1399 degrees of freedom
## Multiple R-squared:  0.00664,    Adjusted R-squared:  0.000959 
## F-statistic: 1.17 on 8 and 1399 DF,  p-value: 0.314

After removing influential points and using all predictors:

  • Building.Size.m.2. (p = 0.062) and Peak.Demand.Reduction.Indicator (p = 0.055) are marginally significant at the 0.10 level
  • None of the other predictors shows statistical significance, with very high p-values ranging from 0.233 to 0.914
  • The R-squared value remains extremely low (0.00664), indicating the model explains less than 1% of the variance in energy consumption
  • The F-statistic of 1.17 with p-value = 0.314 suggests that the overall model is not significantly better than using the mean alone to predict energy consumption
  • The residual standard error of 17.6 indicates considerable unexplained variation in the response variable

Model Diagnostics

Let’s check the assumptions of our model:

# Check normality assumption
qqnorm(resid(mlr_no_influential))
qqline(resid(mlr_no_influential))

# Check constant variance
plot(fitted(mlr_no_influential), resid(mlr_no_influential),
     xlab = "Fitted values", ylab = "Residuals",
     main = "Residuals vs Fitted Values")
abline(h = 0, col = "red")

The diagnostic plots reveal:

  • The Q-Q plot fails to show clear normal distribution of residuals, with some deviation in the tails
  • The residuals vs. fitted plot shows pattern of most points clustered around fitted values of 54, suggesting heteroscedasticity

These results suggest we might need to consider:

  1. Non-linear relationships between predictors and response
  2. Interaction effects between predictors
  3. Additional important variables not currently in the model
  4. Different modeling approaches beyond linear regression

Response transformation

# Perform Box-Cox transformation analysis
library(MASS)
bc = boxcox(mlr_no_influential)

lambda = bc$x[which.max(bc$y)]
print(lambda)
## [1] 0.9091

Since lambda is 0.91 which is close to 1, we can try both no transformation and a log transformation to compare which model performs better.

# Log model
mlr_log = lm(log(Energy.Consumption..kWh.) ~ Temperature + Solar.Irradiance + 
               HVAC.Consumption..kWh. + Lighting.Consumption..kWh. + 
               Energy.Price....kWh. + Building.Age..years. + Building.Size.m.2. +
               Peak.Demand.Reduction.Indicator,
               data = train_data[-influential,])
summary(mlr_log)
## 
## Call:
## lm(formula = log(Energy.Consumption..kWh.) ~ Temperature + Solar.Irradiance + 
##     HVAC.Consumption..kWh. + Lighting.Consumption..kWh. + Energy.Price....kWh. + 
##     Building.Age..years. + Building.Size.m.2. + Peak.Demand.Reduction.Indicator, 
##     data = train_data[-influential, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2325 -0.1987  0.0616  0.2731  0.7975 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      3.9796932  0.0677435   58.75   <2e-16 ***
## Temperature                     -0.0001270  0.0010314   -0.12    0.902    
## Solar.Irradiance                -0.0001093  0.0001020   -1.07    0.284    
## HVAC.Consumption..kWh.          -0.0000576  0.0015208   -0.04    0.970    
## Lighting.Consumption..kWh.      -0.0004536  0.0020403   -0.22    0.824    
## Energy.Price....kWh.            -0.0153366  0.2029711   -0.08    0.940    
## Building.Age..years.             0.0013022  0.0010668    1.22    0.222    
## Building.Size.m.2.              -0.0000377  0.0000206   -1.83    0.068 .  
## Peak.Demand.Reduction.Indicator -0.0252638  0.0299797   -0.84    0.400    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.379 on 1399 degrees of freedom
## Multiple R-squared:  0.00478,    Adjusted R-squared:  -0.000907 
## F-statistic: 0.841 on 8 and 1399 DF,  p-value: 0.567
# Check diagnostics for both models
# Original model diagnostics
par(mfrow=c(1,2))
plot(mlr_no_influential, which=1)
plot(mlr_no_influential, which=2)

# Log transformed model diagnostics
par(mfrow=c(1,2))
plot(mlr_log, which=1)
plot(mlr_log, which=2)

# Formal tests for both models
library(lmtest)
# Original model tests
shapiro.test(resid(mlr_no_influential))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(mlr_no_influential)
## W = 1, p-value = 0.007
bptest(mlr_no_influential)
## 
##  studentized Breusch-Pagan test
## 
## data:  mlr_no_influential
## BP = 17, df = 8, p-value = 0.03
# Log transformed model tests
shapiro.test(resid(mlr_log))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(mlr_log)
## W = 0.94, p-value <2e-16
bptest(mlr_log)
## 
##  studentized Breusch-Pagan test
## 
## data:  mlr_log
## BP = 12, df = 8, p-value = 0.2

From bptest, swtest, fitted vs residuals plot and qq plot, we can conclude that the original model without influential points is better.

Interaction

# Test interactions between temperature and HVAC consumption
# (since temperature likely affects HVAC usage)
mlr_interaction1 = lm(Energy.Consumption..kWh. ~ Temperature * HVAC.Consumption..kWh. + 
                      Solar.Irradiance + Lighting.Consumption..kWh. + 
                      Energy.Price....kWh. + Building.Age..years. + Building.Size.m.2. +
                      Peak.Demand.Reduction.Indicator,
                      data = train_data[-influential,])

# Test interactions between temperature and building size
# (since larger buildings might be more affected by temperature changes)
mlr_interaction2 = lm(Energy.Consumption..kWh. ~ Temperature * Building.Size.m.2. + 
                      HVAC.Consumption..kWh. + Solar.Irradiance + 
                      Lighting.Consumption..kWh. + Energy.Price....kWh. + 
                      Building.Age..years. + Peak.Demand.Reduction.Indicator,
                      data = train_data[-influential,])

# Test interactions between solar irradiance and HVAC consumption
# (since solar heat might affect HVAC needs)
mlr_interaction3 = lm(Energy.Consumption..kWh. ~ Solar.Irradiance * HVAC.Consumption..kWh. + 
                      Temperature + Lighting.Consumption..kWh. + 
                      Energy.Price....kWh. + Building.Age..years. + Building.Size.m.2. +
                      Peak.Demand.Reduction.Indicator,
                      data = train_data[-influential,])

# Compare models using anova
anova(mlr_no_influential, mlr_interaction1)
## Analysis of Variance Table
## 
## Model 1: Energy.Consumption..kWh. ~ Temperature + Solar.Irradiance + HVAC.Consumption..kWh. + 
##     Lighting.Consumption..kWh. + Energy.Price....kWh. + Building.Age..years. + 
##     Building.Size.m.2. + Peak.Demand.Reduction.Indicator
## Model 2: Energy.Consumption..kWh. ~ Temperature * HVAC.Consumption..kWh. + 
##     Solar.Irradiance + Lighting.Consumption..kWh. + Energy.Price....kWh. + 
##     Building.Age..years. + Building.Size.m.2. + Peak.Demand.Reduction.Indicator
##   Res.Df    RSS Df Sum of Sq    F Pr(>F)
## 1   1399 433357                         
## 2   1398 433308  1      48.8 0.16   0.69
anova(mlr_no_influential, mlr_interaction2)
## Analysis of Variance Table
## 
## Model 1: Energy.Consumption..kWh. ~ Temperature + Solar.Irradiance + HVAC.Consumption..kWh. + 
##     Lighting.Consumption..kWh. + Energy.Price....kWh. + Building.Age..years. + 
##     Building.Size.m.2. + Peak.Demand.Reduction.Indicator
## Model 2: Energy.Consumption..kWh. ~ Temperature * Building.Size.m.2. + 
##     HVAC.Consumption..kWh. + Solar.Irradiance + Lighting.Consumption..kWh. + 
##     Energy.Price....kWh. + Building.Age..years. + Peak.Demand.Reduction.Indicator
##   Res.Df    RSS Df Sum of Sq    F Pr(>F)
## 1   1399 433357                         
## 2   1398 433329  1        28 0.09   0.76
anova(mlr_no_influential, mlr_interaction3)
## Analysis of Variance Table
## 
## Model 1: Energy.Consumption..kWh. ~ Temperature + Solar.Irradiance + HVAC.Consumption..kWh. + 
##     Lighting.Consumption..kWh. + Energy.Price....kWh. + Building.Age..years. + 
##     Building.Size.m.2. + Peak.Demand.Reduction.Indicator
## Model 2: Energy.Consumption..kWh. ~ Solar.Irradiance * HVAC.Consumption..kWh. + 
##     Temperature + Lighting.Consumption..kWh. + Energy.Price....kWh. + 
##     Building.Age..years. + Building.Size.m.2. + Peak.Demand.Reduction.Indicator
##   Res.Df    RSS Df Sum of Sq    F Pr(>F)
## 1   1399 433357                         
## 2   1398 433282  1        75 0.24   0.62
# Summary of each interaction model
summary(mlr_interaction1)
## 
## Call:
## lm(formula = Energy.Consumption..kWh. ~ Temperature * HVAC.Consumption..kWh. + 
##     Solar.Irradiance + Lighting.Consumption..kWh. + Energy.Price....kWh. + 
##     Building.Age..years. + Building.Size.m.2. + Peak.Demand.Reduction.Indicator, 
##     data = train_data[-influential, ])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.31 -12.55  -0.11  12.68  59.95 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        57.230319   3.882802   14.74   <2e-16 ***
## Temperature                        -0.052732   0.117398   -0.45    0.653    
## HVAC.Consumption..kWh.             -0.038111   0.158389   -0.24    0.810    
## Solar.Irradiance                   -0.004364   0.004744   -0.92    0.358    
## Lighting.Consumption..kWh.         -0.015065   0.094893   -0.16    0.874    
## Energy.Price....kWh.               -0.993956   9.440101   -0.11    0.916    
## Building.Age..years.                0.058610   0.049631    1.18    0.238    
## Building.Size.m.2.                 -0.001795   0.000958   -1.87    0.061 .  
## Peak.Demand.Reduction.Indicator    -2.679632   1.394351   -1.92    0.055 .  
## Temperature:HVAC.Consumption..kWh.  0.002731   0.006881    0.40    0.692    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.6 on 1398 degrees of freedom
## Multiple R-squared:  0.00675,    Adjusted R-squared:  0.000357 
## F-statistic: 1.06 on 9 and 1398 DF,  p-value: 0.393
summary(mlr_interaction2)
## 
## Call:
## lm(formula = Energy.Consumption..kWh. ~ Temperature * Building.Size.m.2. + 
##     HVAC.Consumption..kWh. + Solar.Irradiance + Lighting.Consumption..kWh. + 
##     Energy.Price....kWh. + Building.Age..years. + Peak.Demand.Reduction.Indicator, 
##     data = train_data[-influential, ])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -49.3  -12.4    0.0   12.7   59.9 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     57.0653824  3.9889575   14.31   <2e-16 ***
## Temperature                     -0.0460324  0.1284298   -0.36    0.720    
## Building.Size.m.2.              -0.0024316  0.0023477   -1.04    0.301    
## HVAC.Consumption..kWh.           0.0176137  0.0707536    0.25    0.803    
## Solar.Irradiance                -0.0042491  0.0047510   -0.89    0.371    
## Lighting.Consumption..kWh.      -0.0143885  0.0949073   -0.15    0.880    
## Energy.Price....kWh.            -1.0414412  9.4405938   -0.11    0.912    
## Building.Age..years.             0.0597152  0.0496530    1.20    0.229    
## Peak.Demand.Reduction.Indicator -2.6699617  1.3944959   -1.91    0.056 .  
## Temperature:Building.Size.m.2.   0.0000303  0.0001008    0.30    0.764    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.6 on 1398 degrees of freedom
## Multiple R-squared:  0.0067, Adjusted R-squared:  0.000309 
## F-statistic: 1.05 on 9 and 1398 DF,  p-value: 0.399
summary(mlr_interaction3)
## 
## Call:
## lm(formula = Energy.Consumption..kWh. ~ Solar.Irradiance * HVAC.Consumption..kWh. + 
##     Temperature + Lighting.Consumption..kWh. + Energy.Price....kWh. + 
##     Building.Age..years. + Building.Size.m.2. + Peak.Demand.Reduction.Indicator, 
##     data = train_data[-influential, ])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.30 -12.48   0.01  12.64  59.97 
## 
## Coefficients:
##                                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)                             57.629967   4.112461   14.01   <2e-16
## Solar.Irradiance                        -0.010259   0.012953   -0.79    0.428
## HVAC.Consumption..kWh.                  -0.059565   0.173057   -0.34    0.731
## Temperature                             -0.010577   0.047974   -0.22    0.826
## Lighting.Consumption..kWh.              -0.014596   0.094890   -0.15    0.878
## Energy.Price....kWh.                    -1.177472   9.445454   -0.12    0.901
## Building.Age..years.                     0.059189   0.049612    1.19    0.233
## Building.Size.m.2.                      -0.001773   0.000959   -1.85    0.065
## Peak.Demand.Reduction.Indicator         -2.687065   1.394461   -1.93    0.054
## Solar.Irradiance:HVAC.Consumption..kWh.  0.000358   0.000728    0.49    0.623
##                                            
## (Intercept)                             ***
## Solar.Irradiance                           
## HVAC.Consumption..kWh.                     
## Temperature                                
## Lighting.Consumption..kWh.                 
## Energy.Price....kWh.                       
## Building.Age..years.                       
## Building.Size.m.2.                      .  
## Peak.Demand.Reduction.Indicator         .  
## Solar.Irradiance:HVAC.Consumption..kWh.    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.6 on 1398 degrees of freedom
## Multiple R-squared:  0.00681,    Adjusted R-squared:  0.000417 
## F-statistic: 1.07 on 9 and 1398 DF,  p-value: 0.386
# Temperature x HVAC interaction plot
ggplot(train_data[-influential,], 
       aes(x = Temperature, y = Energy.Consumption..kWh., color = HVAC.Consumption..kWh.)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm") +
  labs(title = "Interaction between Temperature and HVAC Consumption",
       x = "Temperature",
       y = "Energy Consumption (kWh)",
       color = "HVAC Consumption (kWh)")

# Temperature x Building Size interaction plot
ggplot(train_data[-influential,], 
       aes(x = Temperature, y = Energy.Consumption..kWh., color = Building.Size.m.2.)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm") +
  labs(title = "Interaction between Temperature and Building Size",
       x = "Temperature",
       y = "Energy Consumption (kWh)",
       color = "Building Size (m²)")

# Solar Irradiance x HVAC interaction plot
ggplot(train_data[-influential,], 
       aes(x = Solar.Irradiance, y = Energy.Consumption..kWh., color = HVAC.Consumption..kWh.)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm") +
  labs(title = "Interaction between Solar Irradiance and HVAC Consumption",
       x = "Solar Irradiance",
       y = "Energy Consumption (kWh)",
       color = "HVAC Consumption (kWh)")

Interaction Model Results

Our analysis of the interaction models revealed several key findings that none of the tested interactions showed statistical significance:

  • Temperature × HVAC Consumption: p = 0.692
  • Temperature × Building Size: p = 0.764
  • Solar Irradiance × HVAC Consumption: p = 0.623

ANOVA tests comparing models with and without interactions demonstrated no significant improvement:

  • Temperature × HVAC Consumption: F = 0.16, p = 0.69
  • Temperature × Building Size: F = 0.09, p = 0.76
  • Solar Irradiance × HVAC Consumption: F = 0.24, p = 0.62

The R-squared values remained consistently low (approximately 0.006-0.007) across all interaction models, indicating poor explanatory power.

Visualisation Analysis

The visualization of these relationships revealed:

  • Data points showed high scatter with nearly flat regression lines
  • No discernible patterns of interaction effects were visible
  • The data exhibited substantial variability

These findings suggest that adding interaction terms did not enhance the model’s predictive capabilities. This leads to several possible interpretations about the relationships between variables that they are eithey truly independent of each other or they are related in a non-linear fashion or they are influenced by unmeasured factors not present in our dataset.

Discussions

Dataset Limitations

  1. Temporal Coverage:
    • The dataset was only sampled about 2000 observations covers one year (2023)
    • Seasonal patterns over multiple years cannot be analyzed and Long-term trends in energy consumption are not captured.

Potential Biases

  1. Sample Selection Bias:
    • Data limited to Southern California, may not generalize to other regions
    • Random sampling of 2000 observations might not fully represent all building types
    • Removal of influential points could affect model generalizability
  2. Temporal Bias:
    • 2023 might have had unusual weather patterns
    • Energy consumption patterns might be affected by post-COVID changes
    • Single-year data might not capture typical usage patterns

Future Research Directions

  1. Data Collection Improvements: Collect multi-year data to capture long-term trends
  2. Modeling Enhancements:
    • Explore advanced machine learning techniques
    • Develop time series models for better prediction
  3. Additional Analysis Areas:
    • Study interaction between building characteristics and weather patterns
    • Analyze cost-effectiveness of energy-saving measures
    • Investigate the impact of different energy management strategies

This comprehensive analysis, despite its limitations, provides valuable insights for building energy management. While our models showed limited predictive power, they highlighted important factors affecting energy consumption. Future work should focus on gathering more detailed data and exploring more sophisticated modeling approaches to better understand and predict building energy consumption patterns.